Jane Street Market Prediction
Buy low, sell high. It sounds so easy….
1% Better Everyday
-
I see that autoencoders are widely being used in this competition. However, I am having some doubts on wether the autoencoded features really help the model. As shown in the notebook, the compressed features concatenated back into the original data as extra featues. For me personally, I've tried 2~3 different random seed and the variation on the results are large. It makes me wonder that the autoencoder is only useful on certain features, so dropping these from the autoencoder might be a good idea.
-
Maybe we can feed the autoencoded features to non-nn models e.g. Xgboost, CatBoost, CBM? refence
-
Maybe we need to remove correlated featues before generating autoencoded features.
Overview
The efficient market hypothesis posits that markets cannot be beaten because asset prices will always reflect the fundamental value of the assets. In a perfectly efficient market, buyers and sellers would have all the agency and information needed to make rational trading decisions.
In reality, financial markets are not efficient. The purpose of this trading model is to identify arbitrage opportunities to "buy low and seell high". In other words, we exploit market inefficiencies to identify and decide whether to execute profitable trades.
The dataset, provided by Jane Street, contains an anonymized set of 129 features representing real stock market data. Each row in the dataset represents a trading opportunity, for which I predict an action value: 1 to amke the trade and 0 to pass on it. Due to the high demensionality of the dataset, I use Principal Components Analysis (PCA) to identify features to be used for supervised learning. The intuition is to compress the dataset and use it more efficiently. I then use XGBoost (extreme gradient boosting) - a hugely popular ML library due to its superior execution speed and model performance - to predict profitable trades. I also use Optuna (an automatic hyperparameter optimization software framework) to tune the hyperparameters of the classification model.
import numpy as np
import pandas as pd
import seaborn as sns
import albumentations as A
import matplotlib.pyplot as plt
import os, gc, cv2, random, warnings
import re, math, sys, json, pprint, pdb
import tensorflow as tf
from tensorflow.keras import backend as K
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split
import dabl
import datatable as dt
import kerastuner as kt
import altair as al
alt.data_transformers.enable('default', max_rows=None)
warnings.simplefilter('ignore')
print(f"Using TensorFlow v{tf.__version__}")
#@title Notebook type { run: "auto", display-mode:"form" }
SEED = 10120919
DEBUG = False #@param {type:"boolean"}
TRAIN = True #@param {type:"boolean"}
def seed_everything(seed=0):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
GOOGLE = 'google.colab' in str(get_ipython())
KAGGLE = not GOOGLE
seed_everything(SEED)
print("Running on {}!".format(
"Google Colab" if GOOGLE else "Kaggle Kernel"
))
- Anonymized set of features, featue_{0..129}, representing real stock market data.
- Each row in the dataset represents a trading opportunity, for which which you will be predicting an action value: 1 to make the trade and 0 to pass on it.
- In the training set, you are provided a
respvalue, as well as several otherresp_{1,2,3,4}values that represent returns over different time horizons. These variables are not included in the test set. - Trades with
weight = 0were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.
First off, since this is a large file to gulp. I will try to use datatable to load the data, and then convert to a pandas dataframe.
%%time
train_dt = dt.fread(f"{input_path}train.csv")
%%time
train_df = train_dt.to_pandas()
The dataframe is 2390490 rows by 138 columns
train_df.info()
train_df.head()
-
Intraday trading patterns in hight frequency algorithmic trading are far more linked than interday patterns, so that why I have decided to look at only one data of data for EDA.
-
Since not enough information is given about these featues, I am assuming that
ts_idhere represent packet Number of any commodity from any stock exchange that was used while creating this data. -
These packets can be any type usually the main type of packets that we receive of different commidities from stock exchange are about Trade, Delete, update, Reduce Messages/ Packets about changes in limit order book.
train_df.sort_values(by=['date','ts_id'], inplace=True)
After sorting data by date, let's look at only the first day of data
sample_df = train_df.query('date == 0')
sample_df.describe()
nan_df = (sample_df.describe()
.iloc[0,:]
.apply(lambda x : len(sample_df)-x)
.to_frame()
.reset_index()
.query('count > 100'))
alt.Chart(nan_df).mark_bar().encode(
x='count',
y="index",
).properties(width=600)
From the bar chart shown above, there are lots of missing values. I simply replace the nan with the mean value because mean reversion strategy) that I think has been used here.
sample_df = sample_df.apply(lambda x: x.fillna(x.mean()), axis=0)
Here I am plotting the histogram of all the features that are present in this dataset.
alt.Chart(sample_df).transform_fold(
['feature_0', 'feature_1', 'feature_2'],
as_=['Experiment', 'Measurement']
).mark_area(
opacity=.3,
interpolate='step',
).encode(
alt.X("Measurement:Q", bin=alt.Bin(maxbins=100)),
alt.Y('count()', stack=None),
alt.Color('Experiment:N')
)